Strategies for Reprocessing Aggregated Metadata

نویسندگان

  • Muriel Foulonneau
  • Timothy W. Cole
چکیده

The OAI protocol facilitates the aggregation of large numbers of heterogeneous metadata records. In order to make harvested records useable in the context of an OAI service provider, the records typically must be filtered, analyzed and transformed. The CIC metadata portal harvests 450,000 records from 18 repositories at 9 U.S. Midwestern universities. The process implemented for transforming metadata records for this project supports multiple workflows and end-user interfaces. The design of the metadata transformation process required trade-offs between aggregation homogeneity and utility for purpose and pragmatic constraints such as feasibility, human resources, and processing time. 1 Aggregating Metadata Describing Scholarly Resources In recent years, large aggregations of metadata describing heterogeneous resources have been created using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI service providers who build applications on top of such aggregations must amalgamate large amounts of metadata harvested in a range of formats. By reprocessing harvested metadata, service providers can adapt metadata for their specific use and present those metadata to end users in an integrated fashion. The process of adapting metadata for another application than originally envisioned when the metadata records were created, i.e., repurposing metadata, requires analyzing the metadata harvested, identifying processes to apply to the metadata, and then building the reprocessing system to select, transform and organize the metadata. The present paper discusses issues related to metadata analysis and the implementation of a metadata reprocessing system. It suggests a range of strategies for metadata reprocessing and adaptation and identifies issues needing further study. 1.1 The CIC Portal: an Aggregation of 450,000 Metadata Records The CIC metadata portal is a major metadata aggregation encompassing digital resources from 9 Midwestern universities in the U.S., mostly holdings of academic research libraries. It provides access to 450,000 descriptive metadata records from 152 defined collections. The CIC metadata portal has three main interfaces. A primary search and retrieval interface provides classic digital library access points for scholars: author, title, subject, type, and a number of filtering and grouping functionalities based on dates and collections,. A clickable geographic map allows users to browse by spatial coverage attributes of the resources indexed. Finally a second search and retrieval interface is provided that takes advantage of both collection-level and item-level descriptions in concert[3]. The interfaces were designed to improve information retrieval in aggregated collections, improve usability of heterogeneous information, and demonstrate the wealth of U.S. Midwestern digital library resources. To enable these three interfaces, metadata is reprocessed and then ingested by two distinct systems: the DLXS software developed by the University of Michigan on top of the OpenText XPat search engine; and a Microsoft SQL database. We describe below the implementation of our initial metadata reprocessing system, detailing workflow from harvesting to data publishing. 1.2 Metadata Reprocessing: Challenges and Objectives "A metadata record is created in the objective of a specific use." [2] The challenge of repurposing metadata records is to reuse records created to fit one context in a different context with different constraints and objectives. Any attempt to reuse descriptive metadata runs the risk of misusing (misunderstanding) those records. The problem is exacerbated since even assuming that metadata records are generally welladapted to the original context for which they were created, this original context typically remains partially or totally unknown to service providers. Moreover, OAIPMH is designed to facilitate the harvesting of the same metadata records by multiple service providers, each likely to have their own unique purpose and context. However, since metadata records are expensive and resource consuming to generate, their reusability has great potential for benefit and is at the core of a number of current initiatives, notably the National Science Digital Library and the Digital Library Federation working group on best practices for OAI and shareable metadata. These considerations dictate a thorough analysis of harvested metadata evaluated in terms of the service provider's context and typically applying different measures of metadata quality and utility [9] than were applied when the metadata records were originally created. Metadata records adequate in original local context may not be adequate in an aggregated context. In reprocessing harvested metadata, the service provider's challenge is to implement strategies to transform harvested metadata in a way that enhances usefulness, while avoiding misuse or misunderstanding. 1.3 Technical Implementation of Metadata Reprocessing For the CIC portal, harvest of the OAI metadata provider repositories (18 in all) is customized according to a number of configurable parameters, including the strictness of XML validation, the specific sets to harvest, and the metadata format to harvest. Once harvested, records are first processed through a program that selects relevant records (for the purposes of the aggregation). Selected records are then sent through an XSL pipeline (chain of XSLT files) which implements a series of transformations. Each pipeline is customized by repository and composed of a number of XSLT stylesheets (five to eight) that are named in a configuration file. Only two XSLT stylesheets per repository are repository-specific. Repository specific stylesheets mostly implement a subset of element-specific normalization and augmentation functions. To facilitate normalization, the stylesheets import a generic dictionary of XSLT templates and a number of data dictionaries encoded in XML: e.g., ISO639 language codes, Dublin Core Metadata Initiative and CIC type vocabularies, a date data dictionary, ISO3166 and ISO3166-2US geospatial codes, and a subset of Internet MIME Type (IMT) format string values. The selected, normalized, and enriched metadata records are stored in a distinct location separate from where the as-harvested metadata records are maintained. Periodically a procedure is run to upload the enriched metadata records into the Microsoft SQL database. Another program applies an additional stylesheet and concatenates transformed metadata records as required for use in the DLXS-based service mentioned above. The concatenated files are then transferred to another server where a shell script routine rebuilds DLXS indexes. Specifics of the metadata record filtering, normalization, and enrichment tasks implemented for the CIC portal are described below in sections 3 & 4.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effective Metadata for Social Book Search from a User Perspective

In this extended abstract we describe our participation in the INEX 2014 Interactive Social Book Search Track. In previous work, we have looked at the impact of professional and user-generated metadata in the context of book search, and compared these different categories of metadata in terms of retrieval effectiveness. Here, we take a different approach and study the use of professional and us...

متن کامل

Metadata Enrichment for Automatic Data Entry Based on Relational Data Models

The idea of automatic generation of data entry forms based on data relational models is a common and known idea that has been discussed day by day more than before according to the popularity of agile methods in software development accompanying development of programming tools. One of the requirements of the automation methods, whether in commercial products or the relevant research projects, ...

متن کامل

Treatment of Eye Movement Desensitization and Reprocessing for Post-Traumatic Stress Disorder in Iran: A Systematic Review Study

Although a variety of treatments have been developed over the past decade to treat post-traumatic stress disorder (PTSD), one of the most effective psychological treatments to improve its symptoms is eye movement desensitization and reprocessing (EMDR). This study was performed with the aim to systematically review the studies which have focused on the effect of eye movement desensitization and...

متن کامل

Value-based metadata quality assessment

This article proposes a method that allows a value-based assessment of metadata quality and construction of a baseline quality model. The method is illustrated on a large-scale, aggregated collection of simple Dublin core metadata records. An analysis of the collection suggests that metadata providers and end users may have different value structures for the same metadata. To promote better use...

متن کامل

Integrated Trauma-Focused Cognitive-Behavioural Therapy for Post-traumatic Stress and Psychotic Symptoms: A Case-Series Study Using Imaginal Reprocessing Strategies

Despite high rates of trauma in individuals with psychotic symptoms, post-traumatic stress symptoms are frequently overlooked in clinical practice. There is also reluctance to treat post-traumatic symptoms in case the therapeutic procedure of reprocessing the trauma exacerbates psychotic symptoms. Recent evidence demonstrates that it is safe to use reprocessing strategies in this population. Ho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005